System and method for confidentiality-preserving rank-ordered search

ABSTRACT

A confidentiality preserving system and method for performing a rank-ordered search and retrieval of contents of a data collection. The system includes at least one computer system including a search and retrieval algorithm using term frequency and/or similar features for rank-ordering selective contents of the data collection, and enabling secure retrieval of the selective contents based on the rank-order. The search and retrieval algorithm includes a baseline algorithm, a partially server oriented algorithm, and/or a fully server oriented algorithm. The partially and/or fully server oriented algorithms use homomorphic and/or order preserving encryption for enabling search capability from a user other than an owner of the contents of the data collection. The confidentiality preserving method includes using term frequency for rank-ordering selective contents of the data collection, and retrieving the selective contents based on the rank-order.

CROSS-REFERENCE TO RELATED APPLICATION(S)

This application claims the benefit of provisional patent applicationU.S. Ser. No. 61/109,291, filed Oct. 29, 2008, which is expresslyincorporated herein by reference.

GOVERNMENT SUPPORT CLAUSE

This invention was made with government support under H9823005C0425awarded by NSA. The government has certain rights in the invention.

BACKGROUND OF INVENTION

a. Field of Invention

This invention relates to information search and retrieval. Inparticular, the instant invention relates to a system and method forinformation search and retrieval in large-scale encrypted databases,with a particular embodiment employing a confidentiality-preservingrank-ordered search.

b. Background Art

In today's information era, efficient and effective search capability ofdigital collections is essential in information management and knowledgediscovery. At the same time, many data collections have to be stored inan encrypted form to limit their access to only authorized users inorder to protect confidentiality and privacy. Examples of such datacollections include medical records, corporate proprietarycommunications, and classified government documents. An emergingcritical issue that must be addressed is how to protect data collectionsand indexes through encryption, while simultaneously providing efficientand accurate search capabilities.

A known method of data protection from theft or intrusion includescryptography encryption. If the contents of a data storage system arenot encrypted, any outsider intruding into the system may gain knowledgeof the data content. In addition to such outsider attacks, securitymeasures must also be taken against potential insider attacks. Forexample, when data storage is outsourced to a third-party data center,system administrators and other personnel involved may not be trusted tohave decryption keys and thus have access to the content of the datacollections. When an authorized user remotely accesses the datacollection to search and retrieve desired documents, the large size ofthe collections can often make it infeasible to transfer all encrypteddata to the user's side, and then perform decryption and search on theuser's trusted computers. Therefore, new techniques are needed toencrypt and organize data collections in such a way as to allow the datacenter to perform effective and efficient search in encrypted data.

A number of scenarios exist where the content owner may want to grant auser limited access to search a confidential collection. For example,the searcher may be a scholar or a low-level analyst who wants toidentify relevant documents from a private/classified collection, andmay need clearance only for the top-ranked documents; the searcher mayalso be the opposing party during the document discovery phase of alitigation, who would request relevant documents from the contentowner's digital collection (e.g. e-mails) be turned in. Conventionalpractices to accommodate such searches on hard-copy collections areextremely time consuming, and are often based on human factors (e.g.have limited memory and bounded by rules of privilege) that cannot allbe directly extended to computerized practice. New algorithms andprocesses are thus needed to enable secure search for a variety ofapplications.

There has been a considerable amount of prior work on algorithms anddata structures to support information retrieval. The vast majority ofsuch work has focused on efficient representation and effective ranking.There has also been minimal effort in addressing secure searching, andsuch effort has typically been limited to small collections. One exampleof a search in encrypted data and private information retrieval includesusing established cryptographic tools as building blocks, and devisingan encryption method to make two subparts of each encrypted term in adocument to hold a special relationship to allow for determination ofthe presence or absence of a query term in an encrypted document. Thismethod still incurs a significant increase in storage (for storing thespecially encrypted documents) and typically involves a linear timecomputational complexity with respect to the number of words in thecollection.

Keyword based approaches to reduce search complexity have beenintroduced at the expense of limited search capabilities confined by akeyword list identified beforehand. The documents containing some of thekeywords are first identified, and the keywords or the keyword indicesare encrypted in a way that facilitates search and retrieval. Securingindices based on Bloom filters have also been proposed to furtherenhance search efficiency, and conjunctive keyword based searches havebeen investigated.

The aforementioned techniques involve a high computational complexity,and target simple Boolean searches to identify the presence or absenceof a term in encrypted text. Furthermore, the aforementioned techniquescannot be easily extended to more sophisticated relevance-rankedsearches over large collections.

The inventors herein have thus recognized the need for balancing privacyand confidentiality with efficiency and accuracy, which pose significantchallenges to the design of search schemes for a number of searchscenarios and large data collections. The inventors herein have alsorecognized the need for a system that focuses on secure and efficientrank-ordered search and retrieval over large data collections.

BRIEF SUMMARY OF THE INVENTION

The confidentiality preserving rank-ordered search system and method ofthe invention focuses on secure and efficient rank-ordered search andretrieval over large data collections. The system includes a frameworkto securely rank-order documents in response to a query, and techniquesfor extracting the most relevant document(s) from an encrypted datacollection. The system and method includes collection of term frequencyinformation for each of the documents in the collection to buildindices, as in traditional retrieval systems in plaintext. The systemand method further includes securing of these indices that wouldotherwise reveal important statistical information about the collectionto protect against statistical attacks. During the search process, thequery terms may be encrypted to prevent the exposure of information tothe data center and other intruders, and also confine the searchingentity to only make queries within an authorized scope. Utilizing theterm frequencies and other document information, schemes are developedherein to securely compute relevance scores of each document, identifythe most relevant documents, and reserve the right to screen and releasethe full content of relevant documents.

For the system and method of the invention, the proposed framework isbuilt upon well-studied cryptographic encryption and hashing primitives.The system includes comparable performance to conventional searchingsystems designed for non-encrypted data in terms of search accuracy. Inaddition to the focus on securing the indexes and ranking, othersecurity issues such as protecting communication links and combatingtraffic analysis are addressed by appropriate security protocols andrandomization.

In an exemplary embodiment, the invention provides a confidentialitypreserving system for performing a rank-ordered search and retrieval ofcontents of a data collection. The system may include a computer systemincluding a search and retrieval algorithm using term frequency and/orsimilar features for rank-ordering selective contents of the datacollection, and enabling secure retrieval of the selective contentsbased on the rank-order.

For the confidentiality preserving system described above, in anembodiment, the search and retrieval algorithm may generate a relevancescore for the rank-ordering based on one or more queries. In anembodiment, the data collection and/or query may be encrypted. The datacollection may include documents and/or multi-media content. The searchand retrieval algorithm may include three algorithms; a baselinealgorithm, a partially server oriented algorithm, and a fully serveroriented algorithm.

In an embodiment, the baseline algorithm may include a pre-processingalgorithm for building a secure term frequency table and an inverse datacollection frequency table, and a search stage algorithm forrank-ordering in response to a query. The pre-processing algorithm mayinclude stemming of selective components of the contents of the datacollection and mapping of the stemmed components in the term frequencytable. The selective components may be words, and the data collectioncontents may be documents. In an embodiment, the search stage algorithmmay include stemming of a query term, searching of the term frequencytable, generation of a relevance score, rank ordering of the selectivecontents of the data collection based on the relevance score, andretrieval of the selective contents of the data collection based on therank order. The pre-processing and search stage algorithms may beexecuted at a user site remote from a data center for storing the datacollection.

In an embodiment, the partially server oriented algorithm may includeperformance of selective computations at a user site remote from a datacenter for storing the data collection. The partially server orientedalgorithm may include building of a term frequency table and/orgeneration of a relevance score at a user site remote from a data centerfor storing the data collection.

In an embodiment, the fully server oriented algorithm may includebuilding of a term frequency table at a user site, and generation of arelevance score at a secure computing unit and/or a data center forstoring the data collection.

In an embodiment, the partially and/or fully server oriented algorithmsmay enable search capability from a user other than an owner of thecontents of the data collection.

The invention also provides a confidentiality preserving method forperforming a rank-ordered search and retrieval of contents of a datacollection. The method may include using term frequency and/or similarfeatures for rank-ordering selective contents of the data collection,and securely retrieving the selective contents based on the rank-order.

For the method described above, in an embodiment, the method may furtherinclude generating a relevance score for the rank-ordering based on atleast one query. The method may further include encrypting the datacollection and/or query. In an embodiment, the data collection mayinclude documents and/or multi-media content.

For the method described above, the method may further include buildinga secure term frequency table and an inverse data collection frequencytable by stemming of selective components of the contents of the datacollection and mapping of the stemmed components in the term frequencytable. In an embodiment, the selective components may include words, andthe data collection contents may include documents. The term frequencytable may be generated at a user site remote from a data center forstoring the data collection.

For the method described above, the method may further include stemmingof a query term, searching of a term frequency table, generation of arelevance score, rank ordering of the selective contents of the datacollection based on the relevance score, and retrieval of the selectivecontents of the data collection based on the rank order. In anembodiment, generation of the relevance score and rank ordering may beperformed at a user site remote from a data center for storing the datacollection. In an embodiment, the term frequency table and relevancescore may be selectively generated at a user site remote from a datacenter for storing the data collection, and/or at a data center forstoring the data collection.

For the method described above, the method may include using homomorphicencryption and/or order preserving encryption for enabling searchcapability from a user other than an owner of the contents of the datacollection.

Additional features, advantages, and embodiments of the invention may beset forth or apparent from consideration of the following detaileddescription, drawings, and claims. Moreover, it is to be understood thatboth the foregoing summary of the invention and the following detaileddescription are exemplary and intended to provide further explanationwithout limiting the scope of the invention as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are included to provide a furtherunderstanding of the invention and are incorporated in and constitute apart of this specification, illustrate preferred embodiments of theinvention and, together with the detailed description, serve to explainthe principles of the invention. In the drawings:

FIG. 1 is a diagram illustrating the confidentiality-preservingrank-ordered search system and method of the invention;

FIG. 2 is a diagram illustrating the generation and securing of indexinformation;

FIG. 3 is a diagram illustrating search and retrieval for aconfidentiality-preserving baseline model scheme according to theinvention;

FIG. 4 is a diagram illustrating search and retrieval in a fully serveroriented scheme according to the invention;

FIGS. 5A and 5B are examples of term frequency histograms, and FIGS. 5Cand 5D are the corresponding histograms of the encrypted term frequencyvalues;

FIG. 6 is a diagram illustrating the partially server oriented schemeaccording to the invention;

FIG. 7 is a precision-recall graph for the baseline scheme, and theorder-preserving encryption scheme according to the invention;

FIG. 8 is a graph illustrating the difference in Mean Average Precision(MAP) between the baseline and order-preserving encryption schemesaccording to the invention;

FIG. 9 is scatter plot of Mean Average Precision (MAP) values for theorder-preserving encryption scheme with different mapping table for eachrow of a TF table, plotted with respect to the baseline scheme; and

FIG. 10 is a graph illustrating use of a modified Kendall distancemeasure for comparing top 20 and top 100 ranks obtained using thebaseline and order-preserving encryption schemes according to theinvention.

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

Referring now to the drawings wherein like reference numerals are usedto identify identical components and steps in the various views, anembodiment of the confidentiality preserving rank-ordered search systemand method (hereinafter the “confidentiality preserving system” or“confidentiality preserving method”) will be described in detail.

Before proceeding with a detailed description of the confidentialitypreserving system and method of the invention, exemplary use-cases willbe described for facilitating an understanding of the invention. Itshould be noted that the use-cases are for exemplary purposes only andshould by no means be used to limit the scope of the invention.

Scenarios of Secure Search

This section discusses representative scenarios where the secure searchover a document collection may take place. As shown in FIG. 1, a diagramillustrating the confidentiality-preserving rank-ordered search systemand method of the invention is illustrated. Referring to FIG. 1, acontent owner 100, (e.g. a supervisor), uses the services of a datacenter 102 to store a large amount of documents, as well as performsearch and retrieval. The content owner may also grant another user 104the permission to search and retrieve his/her documents through the datacenter. Additionally, to prevent leakage of information againstpotential hacker attack, the documents stored at the data center areencrypted at location 106. The content owner manages the contentdecryption keys and may provide decryption services upon the user'srequest. In the following discussion, a few application scenarios willbe examined under this framework.

Case 1: The content owner wants to search for some documents stored atthe data center. He/she has a limited bandwidth connection with the datacenter, and needs to search through the encrypted content withoutdownloading the entire collection. Furthermore, the content owner doesnot trust the data center with his/her unencrypted content. He/she wantsto remotely search and retrieve top-ranked relevant documents withoutrevealing the search terms, document content, and/or document indexinformation to the data center. This scenario will be referred to as theconfidentiality preserving baseline model, as discussed below, where thescheme enables both the confidentiality protection and the use of termfrequency (discussed below) to achieve secure and efficient retrieval.

Case 2: Next, consider the scenario where a user, who is not the contentowner, wants to search for a particular phrase in the set ofconfidential documents held by the data center. This scenario may arisein a number of cases, for example, where the user may be a scholar or alow-level analyst who wants to search relevant documents from aprivate/classified collection, and may need clearance only for thetop-ranked documents. The user may also be the opposing side in alitigation requesting relevant documents from a digital collection (e.g.e-mails) be turned in by the content owner's side. In general, thecontent owner does not trust the data center with the document contentor the term frequency values. However, it is considered herein that thedata center has a secure computing unit (SCU), which is trusted by thecontent owner to some degree. Depending on the level of trust on the SCUby the content owner, the following exemplary scenarios are identified:

Case 2a: the content owner trusts the SCU both with the plain-textdocuments and the associated term-frequency table (discussed below).

Case 2b: the content owner trusts the SCU with the plain-textterm-frequency values, but not with the plain-text documents.

Case 2c: the content owner does not trust the SCU with either theterm-frequency values or the documents in plain-text form, but truststhe SCU with certain computations to be performed on some encryptedversion of the term-frequency (TF) table without disclosing the exactvalues.

In Cases 2a and 2b, the content owner trusts the SCU with the termfrequency values. In this case, the SCU can be considered as a heavilyguarded “Maximum-Security Computing Unit” (MaxSCU) in the data centerthat can be used to decrypt term frequency (TF) table, compute relevancescores using EQ-1 (see below), and rank-order the documents based onthese values. The baseline model introduced under the ConfidentialityPreserving Baseline Model section can be the solution under thisscenario. The MaxSCU, however, is a critical link of the overall systemsecurity and may be subject to heavy attacks, and as such, it can beexpensive to design and maintain such a unit hosted in a data center.

In Case 2c, adversaries' threat on breaking the SCU is alleviated as theSCU only sees some encrypted index data and not the exact plain-textvalues. As such, a SCU with medium security (MedSCU) can be sufficient.This scenario calls for two layers of carefully designed encryptions toallow the SCU to compute relevance scores in the encrypted-domain of thefirst layer and enhance confidentiality outside the SCU with anouter-layer encryption. Two exemplary schemes (e.g. homomorphicencryption (HME) and order-preserving encryption (OPE)) to accomplishthis objective are discussed below in the Secure Ranking of DocumentRelevance section presented below.

If the content owner does not trust the SCU with any plain-text orencrypted data, the content owner's involvement would be required incomputing the relevance score. Thus it would reduce to the baselinemodel discussed in the Confidentiality Preserving Baseline Model sectionpresented below.

Before proceeding with a detailed description of the aforementionedbaseline model, and fully and partially server oriented schemes, as termfrequency statistics of a collection are useful for ranked retrieval,the concepts will be briefly discussed to facilitate development of theproposed schemes.

Term Frequency

Referring to FIG. 1, consider a data collection 108 that containsN^((D)) documents, in which N^((T)) unique terms appear. The termfrequency information for all terms and all documents can be organizedas a table at location 110 of size N^((T))×N^((D)), in which the entryat i^(th) row and j^(th) column indicates the number of occurrences ofthe i^(th) term in the j^(th) document. Term frequency has been employedas a core variable to define the relevance score in rank-orderingdocuments in a collection. One example metric is the Okapi relevancescore CW (i, j), which is defined as:

$\begin{matrix}{{{{CW}\left( {i,j} \right)} = \frac{{{CFW}(i)}{{TF}\left( {i,j} \right)}\left( {K_{1} + 1} \right)}{{K_{1}\left( {1 - b + {b \cdot {{NDL}(j)}}} \right)} + {{TF}\left( {i,j} \right)}}},} & \left( {{EQ}\text{-}1} \right)\end{matrix}$

where N(i) is the number of documents containing the i^(th) term; NDL(j)represents the normalized length of the j^(th) document and is given bydividing the length of the j^(th) document, L(j), by the averagedocument length L_(avg), i.e., NDL(j)=L(j)/L_(avg); and K₁ and b areconstants chosen to achieve the best performance for the particularcollection (see S. E. Robertson and K. S. Jones, “Simple ProvenApproaches to Text Retrieval,” Technical Report TR356, CambridgeUniversity Computer Laboratory, 1997). Exemplary values are K₁=2 andb=0.75. CFW (i) denotes the cumulative frequency of the i^(th) word inthe whole collection and is given by CFW (i)=log(N^((D))/N(i)). The CFWplays an equivalent role as the inverse document frequency used in someinformation retrieval schemes. It can be either pre-computed or obtainedconcurrently from the term frequency table.

Given a query consisting of a single term w(i), the set of relevancescores {CW (i, j), j=1, . . . , N^((D))} can be directly used toidentify the most relevant documents, which have the largest relevancescores over the above set {CW (i, j), j=1, . . . , N^((D))}. If a querycontains multiple terms {w(i₁), w(i₂), . . . , w(i_(M))}, the relevancescores for each of the query terms are added, i.e.,

$\left\{ {{\sum\limits_{i_{k} = i_{1}}^{i_{M}}{{CW}\left( {i_{k},j} \right)}},{\forall j}} \right\},$

and this overall score vector is employed to rank-order the documents.The term frequency table and indices may be secured at location 112.

The confidentiality preserving baseline model, and fully and partiallyserver oriented schemes will now be discussed in detail in the followingsections.

Approach/Scheme I—Confidentiality Preserving Baseline Model

As discussed above, the confidentiality preserving system and method ofthe invention includes a unique framework for performing ranked searchsecurely and efficiently without revealing the indexing information. Forthe baseline scheme, it is assumed that the data center can only betrusted with data storage and should not be allowed to obtain anyinformation about the stored data. To achieve secure search, thebaseline model is proposed that involves multiple rounds of interactionbetween the client and server to obtain the relevant informationpertaining to a query. It should be noted that various aspects of thefully and partially server oriented schemes will also be discussed inconjunction with the baseline model to provide a full understanding ofthe invention. The proposed framework may include two major stages, apre-processing stage for building a secure term frequency table and aninverse document frequency table, and a search stage for rank-orderingdocuments in response to a particular query while preserving theconfidentiality of term frequency information.

Indexing Stage to Secure Term Frequency

The pre-processing is executed once by the content owner, when he/shestores the documents, all in encrypted form, in the data center. Themajor task of the pre-processing stage is to build a secure termfrequency table and an inverse document frequency table, so as tofacilitate efficient and accurate information retrieval.

For an unprotected term frequency table, both the search term and itsterm frequency information are in plain text. To protect theconfidentiality of the search, both of them may be encrypted in anappropriate way. As shown in FIG. 2, a diagram illustrating thegeneration and securing of index information for the baseline model isillustrated. Referring to FIG. 2, a word w in a document first undergoesstemming at location 130 to retain the word root while removing the wordending to obtain w_(S). The stemmed word may then be encrypted atlocation 132 using an encryption function E and the word-key K_(ws),obtaining the encrypted word w_(S) ^((e))=E(K_(w) _(s) ,w_(S)). The wordkey may be unique to each stemmed word and is obtained using the stemmedword and a pre-defined master key. The encrypted word, w_(S) ^((e)) isfurther mapped to a particular row i in the term frequency table, wherethe index i is established via a hashing function at location 134 suchthat i=H(w_(S) ^((e))). With the stemmed word, the term frequencyinformation is collected by counting the number of occurrences of thestemmed word in the j^(th) document and stored in the table entry {TF(i,j)} at location 136.

This process is repeated to obtain the term frequencies for all termsand documents, which are then further encrypted. In the baseline modeldiscussed herein, where the data center can only be trusted with storingdata, a single layer of encryption is sufficient to protect the termfrequency information from both unauthorized users and from the datacenter. The term frequency information, i.e., TF^((s))(i, j)=TF (i, j),is directly used at location 138. If needed, proper encoding can beperformed to minimize the required storage. The encoded term frequencytable denoted by TF_(C) ^((s)) is then encrypted to create TF_(C) ^((e))at location 140 as follows:

TF _(C) ^((e))(i,.)=E(K _(i) ^((TF)) ,TF _(C) ^((s))(i,.))  (EQ-2)

Here, TF_(C) ^((s))(i,.)=C(TF^((s)) (i,.)) represents the encoded termfrequency values obtained through an encoding function C that removesredundancies in the term frequency table. K_(i) ^((TF)) denotes the keyused to encrypt the i^(th) row of the term frequency table TF^((s)). Toincrease the security, the value of K_(i) ^((TF)) is unique for each rowand is derived from the word-key Kw_(S) corresponding to the i^(th) row.Thus, even if the key corresponding to one row is compromised, noinformation can be obtained about other rows of the term frequencytable.

Secure Search Stage

In the baseline model discussed herein, search and retrieval isinitiated by the content owner. As shown in FIG. 3, a diagramillustrating search and retrieval for the confidentiality-preservingbaseline model scheme is illustrated. Referring to FIG. 3, whensearching for a particular word w in the collection, the content ownerfirst performs stemming at location 170 to obtain the stemmed wordw_(S). The word-key is then derived from the master key and used toencrypt the stemmed-word w_(S) to obtain w_(S) ^((e)). After that, thehash value of w_(S) ^((e)) is calculated at location 172 and sent todata center. Using the received hash value k=H(w_(S) ^((e)))), the datacenter searches the protected term frequency table TF_(C) ^((e)) atlocation 174 and identifies the row corresponding to the query word w.In this way, the query content is concealed from the data center.

After the data center identifies the target row TF_(C) ^((e)) (k,.) fromthe encrypted term frequency table TF_(C) ^((e)) based on the calculatedvalue of k=H(w_(S) ^((e))), that particular row TF_(C) ^((e)) (k,.) issent back to the content owner, who then decrypts and decodes atlocation 176 to obtain the plain-text term frequencies {TF(k, j)∀j}. Thecontent owner further computes relevance scores at location 178 from theterm frequency values as in EQ-1, rank-orders the documents based on thescore, and requests the most relevant documents from the data center atlocations 180, 182. When a query consists of multiple terms, w(i₁),w(i₂), . . . , w(i_(M)), these M corresponding rows in TF table areidentified, TF_(C) ^((e))(i₁,.), TF_(C) ^((e))(i₂,.), . . . , TF_(C)^((e))(i_(M),.), and sent back to the content owner for computingrelevance scores. The content owner uses the received information tocompute the relevance scores for each term, and then combines them toobtain the final score.

As discussed in detail herein and below with regard to the baselinemodel, or the fully and partially server oriented schemes, in thebaseline scheme, all of these term frequency rows will be sent back tothe user side for computing relevance scores using the combinedinformation. In the partially server oriented scheme, after the termfrequency rows TF_(C) ^((e))(i₁,.), TF_(C) ^((e))(i₂,.), . . . , TF_(C)^((e))(i_(M),.) go through out-layer decryption and decompression, theserver will perform part of the combination, which is then sent back tothe user side for obtaining the final relevance scores. In the fullyserver oriented scheme, after the outer-layer decryption anddecompression on all the M related term frequency rows, the servercomputes relevance scores for each of them, and then does thecombination to obtain the final scores.

TABLE I Comparison of the Proposed Techniques Partial Server FullyServer Property Baseline Oriented Oriented No. of communication rounds 22 1 Bandwidth requirement High Medium Low for communication MemoryStorage required Low Low Medium at Server Memory Storage required MediumMedium Low at User Security w.r.t outsiders High High High Securityw.r.t Server High High/Medium Medium

Comparison of the Three Searching Schemes: In Table I (Comparison of theProposed Techniques. The scale of low, medium and high only representsthe relative values. These are intended for comparison purposes, and donot signify the performance in absolute terms), the proposed threesearching schemes are compared in terms of storage, bandwidthrequirement, and security. Each of the three approaches has itsadvantages and disadvantages, and may be suitable for differentscenarios depending on the system constraints. It is usually up to theapplication requirement and user preferences to choose the mostappropriate searching scheme in consideration of the specific threatmodel. In the subsequent discussion, techniques developed for each ofthe three schemes are presented in greater detail. For the baselinescheme, as the whole term frequency rows are transmitted from the serverto the user during the searching process, compression of termfrequencies will be discussed for saving communication bandwidth. Forthe partially and fully server oriented schemes, one importantconsideration will be developing appropriate inner-layer encryptionalgorithms to achieve a good tradeoff between data security, retrievalaccuracy, and searching efficiency.

In the baseline model, the data center does not get access to theunencrypted content at any point of time both during the pre-processingand the search and retrieval stage. The data center does not know theterm frequency information as they are stored encrypted. The onlyinformation that the data center gains from the search process is theretrieval log. The retrieval log at most contains data on which usersearched for what encrypted queries, when and how often. The data centermay also learn which documents were requested pertaining to theencrypted search queries. Based on such information collected over aperiod of time, the data center may launch some kinds of statisticalattacks. However, such attacks can be easily mitigated by the contentowner, by adding to his/her requests some phantom terms and documentindices to obfuscate the access statistics of his/her intended terms anddocuments. The content owner can also hide his/her identity byintroducing a proxy in his/her connection link with the data center.

Encoding the term frequency rows helps reduce the bandwidth required forits transmission during the search phase. Value-precision encoding isused herein for encoding to compress the term-frequency rows, whereinthe position and the value of every non-zero term is encoded in theterm-frequency table. As an example, the results with 200,000 e-mailsfrom the Enron e-mail corpus suggest that the average size of thecompressed term frequency rows is 435 bytes, and 86% of them have a sizewithin 200 to 300 bytes (see B. Klimt and Y. Yang, “Introducing theEnron Corpus,” Conf. On Email and Anti-Spam (CEAS), Mountain View,Calif., 2004). Thus, by encoding, the required bandwidth in transmittingthe term frequency rows can also be minimized.

Since computing the relevance score requires the use of cumulativefrequency of a word (CFW) as in EQ-1, the CFW can be computedbefore-hand and encrypted using the same word key as in the termfrequency table. The CFW is then stored in the data center separatelyfrom the term frequency. It can be sent to the content owner along withthe term frequency rows during relevance computation. If the relevancescore is computed by the data center, the CFW can be stored in the datacenter in clear-text form.

Secure Ranking of Document Relevance

The baseline model previously introduced provides secure and effectivesearch to the scenarios where the content owner makes a queryhimself/herself. In this section, two different schemes, namelyhomomorphic encryption and order-preserving encryption (each discussedin greater detail below), are presented for enabling the searchcapability from a user other than the content owner. These schemesreduce the involvement of the content owner either partially orcompletely by shifting the task of computing the relevance score to thedata center, while still maintaining the confidentiality of the termfrequency information and the document content. To achieve the goal, anadditional layer of encryption on the term frequency information isdesigned. This additional layer of encryption is referred to as theinner-layer encryption. Two different types of inner-layerencryptions/schemes, namely, homomorphic encryption and order-preservingencryption are discussed herein. After the inner-layer encryption,TF^((s)) is encoded to obtain TF_(C) ^((s)), and further encrypted toobtain TF_(C) ^((e)) in the same way as in the baseline scheme. Thissecond round of encryption is referred to as outer-layer encryption,which prevents unauthorized users from accessing term frequencyinformation.

FIG. 4 is a diagram illustrating search and retrieval in the fullyserver oriented scheme according to the invention. The indexing andpre-processing stages of the proposed schemes are similar to thebaseline model with an additional inner-layer encryption, and thesearching stage is shown in FIG. 4. When searching for a particularquery consisting of multiple terms, w(i₁), w(i₂), . . . , w(i_(M)), inthe collection, the user first performs stemming to obtain itscorresponding stemmed words. The user then sends the stemmed words tothe content owner, who checks if the user has the required permission tosearch for the query words at location 210. Upon verification, thecontent owner derives the word-keys from the master key and uses it toencrypt the stemmed-words to obtain w_(S) (i_(k))^((e)), k=1, 2, . . . ,M. After that, the hash value of w_(S) (i_(k))^((e)) is calculated andtransmitted to the user who forwards it to the data center. Using thereceived hash values H (w_(S)(i_(k))^((e))) from location 212, the datacenter searches the protected term frequency table TF_(C) ^((e)) atlocation 214 and identifies the rows corresponding to the query words.In this way, the data center does not get any information about thequery.

After the data center identifies the target rows from the term frequencytable TF_(C) ^((e)), it uses the Secure Computing Unit (SCU) to decryptand decode it at location 216, and subsequently obtain the correspondingrows of the term frequency table TF^((S)) that are protected by theinner-layer encryption algorithms. During this stage, the encryptedrows, TF^((S)), are retained within the SCU and not revealed to the datacenter. The SCU then performs part or the entire computation for therelevance scores at location 218 in the encrypted domain as shown inFIG. 4. In the homomorphic encryption based scheme (HME), thecomputation results are then sent to the content owner, who decrypts theresults, obtains the relevance score, and rank-orders the documents.Therefore, HME is also referred to as the partially server orientedscheme. The order of the relevant documents pertaining to the user'squery is sent back to the data center who gives the user thecorresponding documents at location 220. On the other hand, in the orderpreserving encryption based scheme (OPE), the entire computationalburden is shifted to the SCU, which computes relevance scores,rank-orders the documents, and directly sends back to the user the mostrelevant documents with their ranking information. The OPE is alsoreferred to as the fully server oriented scheme.

The main difference between the HME and the OPE schemes is theadditional round of communication between the data center and thecontent owner, and the need of using the content owner's decryption key.As discussed below, the need for this additional round of communicationcan be offset at the cost of slightly reduced retrieval accuracy. In thefollowing sections, details of the OPE and HME schemes are discussed.

Approach/Scheme II—Fully Server Oriented Scheme Based on OrderPreserving Encryption

To remove the need of communications between the data center and contentowner during content search, computations and ranking are performeddirectly on term-frequency data in its inner-encrypted form. Discussedherein is an order preserving encryption scheme (OPE) as the inner-layerencryption and the method of computing and ranking relevance scores inthe encrypted domain.

More specifically, order preserving encryption is applied on TF(i, j) toobtain encrypted TF^((s))(i, j) in the inner-layer encryption step,i.e., if TF(i, j)<TF(i,k), then TF^((s))(i, j)<TF^((s))(i, k). Due tothe monotonicity of the relevance score function in EQ-1, as long as theorder of relevance scores (or the order of term frequency values) ispreserved, rather than their exact values, the correct search resultscan be obtained for queries that involve only one term. Based on theexperimental analysis on the Enron e-mail corpus discussed earlier,generally peak histograms are observed for the term frequency valuesover a large number of rows, and some examples are shown in FIGS. 5A and5B. Applying the existing algorithms of order preserving encryption tosuch generally peaking distributions would not be able to randomize theterm frequency values, since their one-to-one mapping operation willlargely retain generally peaking nature of term frequency distributions,leaking valuable information to the server. Therefore, in order toenhance security and prevent the leak of term-frequency information,appropriate one-to-many mapping is required to flatten the generallypeaking distribution to an approximately uniform distribution andincrease its randomness.

In the one-to-many order preserving encryption method, the encryption isperformed row by row for each of the N^((TF)) terms. The generallypeaking structure of term frequency distribution reflects that there area large number of entries having the same term frequency value inindividual row of the term frequency table. In order to flatten thegenerally peaking distribution, every entry TF (i, j) is mapped with thevalue tf to a random number in the range of [tf^(l),tf^(u)], where0≦tf^(l)≦tf^(u)<2^(B) (B=8 in the experiment) are the lower bound andthe upper bound of the random mapping range that must be carefullychosen. In order to make the one-to-many mapping an order preservingoperation, for two different term frequency values to and tf₂, theirrandom mapping ranges [tf₁ ^(l),tf₁ ^(u)] and [tf₂ ^(l),tf₂ ^(u)] arechosen to satisfy the following constraint:

if tf₁<tf₂, then tf₁ ^(u)<tf₂ ^(l)  (EQ-3)

To maximize the entropy of the encrypted output, the random mappingrange [tf^(l), tf^(u)] for a term frequency value tf is adaptivelydetermined according to the distribution of row term frequency values,so that an approximately uniform distribution can be obtained for theencrypted term frequency values TF^((s))(i, j). More specifically, thewidth of the random mapping range [tf^(l),tf^(u)] is chosen proportionalto the counts of tf^(l) in that particular row. The values of tf^(l) andtf^(u) are then determined with 0≦tf^(l)≦tf^(u)<2^(B) and the constraintin EQ-3. In this way, an approximately uniform distribution can beobtained for the encrypted TF^((s))(i, j) at individual rows.

FIGS. 5A and 5B, briefly discussed above, are examples of term frequencyhistograms, and FIGS. 5C and 5D are the corresponding histograms of theencrypted term frequency values. Applying the proposed random mappingmethod to the two histograms shown in FIGS. 5A and 5B, with the randommapping range determined for individual rows, encrypted TF^((s))(i, j)is obtained with histograms shown in FIGS. 5C and 5D, respectively. Itcan be seen that approximately uniform distributions are obtained afterthe one-to-many order preserving encryption, even though thedistributions of row term frequency values are quite different in thesetwo examples. This indicates that the confidentiality of critical termfrequency information can be protected from both hackers, unauthorizedusers, and the data center that carries out the search task.

Approach/Scheme III—Partially Server Oriented Scheme Using HomomorphicEncyrption

In the partially server oriented scheme discussed herein, after the termfrequency rows TF_(C) ^((e))(i₁,.), TF_(C) ^((e))(i₂,.), . . . , TF_(C)^((e))(i_(M),.) go through outer-layer decryption and decompression, theserver will perform part of the combination, which is then sent back tothe user side for obtaining the final relevance scores. The basis forthe partially server oriented scheme is that in some scenarios such asthat of a mobile computing unit, the computation power of the client andthe bandwidth of the communication channel may be severely limited andthe MedSCU can help perform certain computations in a secure manner.Hence, the amount of data transferred between the client and server andthe amount of computation to be performed by the client should beminimized.

FIG. 6 is a diagram illustrating the partially server oriented schemeaccording to the invention. As shown in FIG. 6, when searching for aparticular word w in the database, the user side first performs stemmingat location 240 to obtain its corresponding stemmed word w_(S). Theword-key is then derived from the master key and used to encrypt thestemmed-word w_(S) to be w_(S) ^((e)) at location 242. After that, thehash value of w_(S) ^((e)) is calculated at location 244 and transmittedto the server side. Using the received hash value H(w_(S) ^((e))), theserver can search the protected term frequency table TF_(C) ^((e)) atlocation 246 and identify the row corresponding to the query word w.

After the server identifies the target row TF_(C) ^((e))(k,.) atlocation 246 from the term frequency table TF_(C) ^((e)), in thepartially server oriented scheme, the server itself decrypts anddecompresses it at locations 248, 250 and subsequently obtains termfrequencies TF^((s))(k,.) that are protected with inner-layer encryptionalgorithms. The server then performs part of or all the computation atlocation 252 in finding the relevance scores in the encrypted domain.After that, the server sends the computation results back to the userside at location 254, which then decrypts the received results andfurther rank-orders the documents. The encrypted documents are thenobtained at location 256, and returned to the user at location 258 fordecryption.

In further detail, for the partially server oriented scheme, for a querysubmitted by the user, the server first extracts the correspondingterm-frequency rows stored in the encrypted format. For each of theidentified rows, TF_(C) ^((e))(i,.), the server decrypts it using theword key and then decompresses it to obtain TF^((s))(i,.) with aninner-layer encryption. Then, in this encrypted domain, at location 252as discussed above, the server performs certain computations towardfinding the relevance scores. The computation results are then sent backto the user, who uses the decryption keys to find the actual values ofthe relevance scores at location 254. The user then rank orders thedocuments using the derived relevance scores and requests the mostpertinent documents from the server at location 256. Similar to thebaseline scheme, the partially server oriented scheme also involves tworounds of communication. In the first round, the user sends the queryword(s) and gets the encrypted relevance scores from the server. Theuser then processes the results to find the relevant documents andrequests the documents in the second round. Unlike the baseline scheme,this method does not require transmission of all term frequency filesrelated to a query. Therefore, it needs much lower bandwidth in thesearching process and would be feasible for low-bandwidth scenarios.

When the server performs the computation of relevance scores, it workson term frequencies TF^((s))(i,.) with an inner-layer encryption.Therefore, the security of the term frequency information with respectto the server itself largely depends on the nature of the inner-layerencryption. Meanwhile, computation results on TF^((s))(i,.) shouldbenefit the user side in the subsequent sorting of final relevancescores. In the following, we show that Homomorphic encryption algorithmsmay be used to encrypt the term-frequency values to enable performingarithmetic computations in the encrypted domain.

Secure Computation of Relevance Scores Based on Homomorphic Encryption

Generally, when the SCU performs the computation of relevance scores, itworks on term frequencies rows, TF^((s))(i,.), encrypted with aninner-layer encryption. Therefore, the security of the term frequencyinformation with respect to the SCU itself largely depends on the natureof the inner-layer encryption. Meanwhile, computation results onTF^((s))(i,.) should benefit the content owner in the subsequent sortingof final relevance scores. Homomorphic encryption algorithms may be usedto encrypt the term-frequency values to enable performing arithmeticcomputations in the encrypted domain (see J. Domingo-Ferrer, “A NewPrivacy Homomorphism and Applications,” Information Processing Letters,Vol. 60, No. 5, pp. 277-282, December 1996, and R. L. Rivest, L.Adleman, and M. L. Dertouzos, “On Data Banks and Privacy Homomorphisms,”Foundations of Secure Computation, Academic Press, 1978, pp. 169-179).The RSA encryption and symmetric homomorphism schemes that may be usedwill now be discussed in detail.

RSA Based Approach

The RSA public-key cryptosystem involves a public key (n, e) and aprivate key (n, d) such that e d≡1(Mod n). A message mεZ_(n)={0, 1, 2, .. . , n−1} is encrypted using the public key (n, e) as c=RSA(M)=Me (Modn). The message can then be recovered using the private key (n, d) asm=c^(d) (Mod n). The RSA encryption scheme has the following property:

$\begin{matrix}{{{\left( {{{RSA}\left( m_{1} \right)}*{{RSA}\left( m_{2} \right)}} \right){mod}\; n} = {{m_{1}^{e}\left( {{mod}\; n} \right)}*{m_{2}^{e}\left( {{mod}\; n} \right)}}},\mspace{284mu} {= {\left( {m_{1}m_{2}} \right)^{e}\left( {{mod}\; n} \right)}},\mspace{284mu} {= {{{RSA}\left( {m_{1}*m_{2}} \right)}.}}} & \left( {{EQ}\text{-}4} \right)\end{matrix}$

This homomorphic property is used to perform relevance scorecomputations at the server's end. To facilitate easy computations in theencrypted domain, the relevance score defined in EQ-1 is approximated asfollows:

$\begin{matrix}{\begin{matrix}{{{{CW}\left( {i,j} \right)} \approx \frac{{{CFW}(i)}{{TF}\left( {i,j} \right)}\left( {K_{1} + 1} \right)}{K_{1}}},} \\{{= {C(i){{TF}\left( {i,j} \right)}}},}\end{matrix}{where}} & \left( {{EQ}\text{-}5} \right) \\{{C(i)} = \frac{{{CFW}(i)}\left( {K_{1} + 1} \right)}{K_{1}}} & \left( {{EQ}\text{-}6} \right)\end{matrix}$

and can be calculated with the knowledge of number of documents that donot contain the i^(th) word. In arriving at EQ-5, the TF(i, j) term isignored in the denominator of EQ-1 and it is assumed that NDL(j)≈1, i.e.the length of all documents is approximately the same and equal to theaverage length. Although ignoring the TF (i, j) term in the denominatorwould change the actual value of CW(i, j), the relative order is stillpreserved as both functions are monotonic in TF(i, j). For queriescontaining multiple terms, EQ-5 is used to compute the relevance scorefor document D(j) for each word in the query, CW(i₁, j),CW(i₂, j), . . ., CW(i_(M), j) and the final relevance score is calculated by

CW(j)=CW(i ₁ ,j)+CW(i ₂ ,j)+ . . . +CW(i _(M) ,j)  (EQ-7)

TABLE II Evaluation of the Retrieval Results using the SimplifiedRelevance Score in EQ-5 Number of Search Terms Ranks 1 2 3 5 Top 10 1010 9 7 Top 20 20 20 20 18 Top 50 50 50 50 48 Top 100 100 100 100 100

To evaluate the performance of the search method using the approximationin EQ-5, the number of documents that are in the top 10, top 20, etc.retrieved using the original OKAPI score are counted and the results arecompared with the ones obtained with the score calculated using EQ-5.Table II shows the results obtained. It should be noted that theapproximation does not affect the performance of the retrieval systemwhen searching for smaller number of query terms, and the performancegradually reduces as the number of query terms increase. This justifiesthe approximation in EQ-5.

While creating the database, the term frequency table TF(i,.) is firstencoded using RSA to obtain TF^((s))=RSA(K₁ ^((s)),TF(i,.). Theencrypted table is then compressed and encrypted again using a symmetricencryption function E and key K_(i) ^((TF)) to obtainTF^((e))(i,.)=E(K_(i) ^((TF)),TF_(C) ^((s))(i,.)) which is stored in thedatabase. The encrypted value of C(i), C^(s)(i)=RSA(K_(i) ^((s)), C(i))is also stored.

In the searching phase, the client sends the query with terms and thecorresponding keys K_(i) ₁ ^((TF)), to the server. For computing therelevance score, CW(i_(m),j),TF^((e))(i_(m),.) is decrypted using thedecryption function D and key K_(i) _(m) ^((TF)) and decompressed toobtain TF^((s))(i_(m),.). The server then performs the followingcomputation to obtain the encrypted values of the relevance scores

RSA(K _(i) _(m) ⁽ s),CW(i _(m) ,j))=RSA(K _(i) _(m) ⁽ s),C(i _(m),j))*RSA(K _(i) _(m) ⁽ s),TF(i _(m) ,j))(mod n)  (EQ-8)

The server then returns RSA(K_(i) _(m) ^((s)),CW(i_(m),.)),m=1, 2, . . .M to the client which decrypts, sums, and sorts the scores. The clientthen requests the relevant files from the server.

The RSA based scheme has the advantage that the relevance scores arecomputed on the server without sacrificing security. However, the amountof data that needs to be transferred to the client is still proportionalto the number of terms in the query. This is due to the fact that theonly operation that is homomorphic in RSA is multiplication, whichlimits the operations that can be performed on the server withoutsacrificing security. To overcome this limitation, a scheme based on ahomomorphic encryption scheme may be utilized, as discussed below.

Symmetric Homomorphism Based Approach

A key-dependent homomorphic encryption algorithm gK, with key K,operating on data items x₁ and x₂, satisfies gK(x₁+x₂)=gK(x₁)+gK(x₂),gK(x_(i)*x₂)=gK(x₁)*gK(x₂), and gK(x₁,*c)=c*gK(x₁) for any constant c.Thus, the function gK is homomorphic with respect to addition andmultiplication operations. Division can then be performed by treating itas operations on rational numbers, and the numerator and denominatorterms can be computed separately as follows:

$\begin{matrix}{{g\left( {\frac{x_{1}}{x_{2}} + \frac{x_{3}}{x_{4}}} \right)} = \frac{{{g\left( x_{1} \right)}*{g\left( x_{4} \right)}} + {{g\left( x_{2} \right)}*{g\left( x_{3} \right)}}}{{g\left( x_{2} \right)}*{g\left( x_{4} \right)}}} & \left( {{EQ}\text{-}9} \right)\end{matrix}$

These properties can be used to efficiently compute the relevancescores. Referring to EQ-1, the Okapi relevance score can now be writtenas follows:

$\begin{matrix}{{{CW}\left( {i,j} \right)} = {\frac{{{TF}\left( {i,j} \right)}{C_{1}(i)}}{{{TF}\left( {i,j} \right)} + {C_{2}(j)}} = \frac{{Num}\left( {i,j} \right)}{{Den}\left( {i,j} \right)}}} & \left( {{EQ}\text{-}10} \right)\end{matrix}$

where C₁(i)=(K₁+1)CFW(i) and C₂(j)=K₁(1−b+b×NDL(j)).

In the pre-processing stage, the content owner encodes each row of theterm frequency table TF(i,.) separately using homomorphic encryption toobtain TF^((s))(i,.)=gK(TF(i,.)), and these results are used in thesearch stage. The values of the constants C₁(i) and C₂(j) are alsocomputed and stored along with the encrypted term frequency rowsTF^((e))(i,.). In the search phase, suppose that a query contains theterms, w(i₁), w(i₂), . . . , w(i_(M)); for each term in the query, theSCU decrypts and decodes the corresponding term frequency row to obtainTF^((s))(i_(m),.). It then obtains the numerator and denominator ofgK(CW(i_(m), j)) for each query term using

gK(Num(i _(m) ,j))=C ₁(i _(m))*gK(TF(i _(m) ,j))  (EQ-11)

gK(Den(i _(m) ,j))=C ₂(j _(m))+gK(TF(i _(m) ,j))  (EQ-12)

The overall encrypted value of the relevance score, gK(CW(j)), is thenobtained by adding the relevance scores in the encrypted domain and canbe shown to be

$\begin{matrix}{{g\; {\kappa \left( {{CW}(j)} \right)}} = \frac{\sum\limits_{m = 1}^{M}{g\; {\kappa \left( {{Num}\left( {i_{m},j} \right)} \right)}{\prod\limits_{\underset{n \neq m}{n = 1}}^{M}\; {g\; {\kappa \left( {{Num}\left( {i_{n},j} \right)} \right)}}}}}{\prod\limits_{m = 1}^{M}\; {g\; {\kappa \left( {{Den}\left( {i_{m},j} \right)} \right)}}}} & \left( {{EQ}\text{-}13} \right)\end{matrix}$

In the absence of the decryption key, the exact value of the relevancescore cannot be computed by the SCU, and the numerator and denominatorof gK(CW(j)) are sent to the content owner/supervisor. The content ownerdecrypts with the secret key to obtain the actual numeric values ofNum(j) and Den(j) to compute the relevance score for each document. Thecontent owner then sorts the relevance scores and sends the list ofrelevant documents to the data center who retrieves them from his/hercollection for the user.

Comparison of RSA and Homomorphic Encryption Approaches

The proposed symmetric homomorphic encryption based scheme has theadvantage that the amount of data transferred between the server and theclient is independent of the number of terms in the query. Also theamount of computation that has to be performed on the client side isreduced by shifting most of the computation operation to the serversside. However, this necessitates that the keys used for encrypting therows of the Term Frequency table TF(i,.), K_(i) ^((s)) be the same. Incontrast, the RSA based scheme does not require that the keys used forencrypting the rows of the term frequency table be the same. Theconsequence is the relatively larger amount of data that needs to betransferred from the server to the client. Thus, depending on the usagescenario, the user may choose one of the two options.

RESULTS/DISCUSSION

Performance of the homomorphic encryption (HME), the order-preservingencryption (OPE), and the baseline model will now be compared in termsof security, retrieval accuracy, and tradeoffs involved in securing theterm frequency using order preserving encryption will be examined. Theretrieval accuracies of the secure search schemes will be evaluated onthe W3C collection, and the 59 queries used for the discussion search inthe enterprise track in the 2005 Text Retrieval Conference (TREC). Anydocument that is judged partially relevant or relevant is taken to berelevant (i.e. conflating the top two judgment levels). In terms ofretrieval accuracy, the performance of the HME scheme should beidentical to the baseline model as it also has the accurate termfrequency information to compute the relevance score.

The performance of the proposed schemes is discussed usingprecision-recall graphs. The precision-recall results for all 59 queriesare collected and the average performance is shown in FIG. 7, whichshows that the retrieval accuracy of the OPE is slightly lower than thatof the baseline scheme. However, this slight drop in performance in OPEcomes with added advantages of fewer communication rounds compared withthe HME and the baseline schemes.

TABLE III Retrieval Accuracy Measures for Various Schemes MetricBaseline OPE Metric Baseline OPE MAP 0.3739 0.3142 P@20 0.4271 0.3839r-prec 0.3878 0.3476 P@30 0.3791 0.3271 bpref 0.3798 0.3412 P@100 0.23660.2056 P@5 0.5424 0.5017 P@1000 0.0471 0.0422 P@10 0.4881 0.4627 RR10.7257 0.6749

The search-retrieval accuracy of the proposed schemes is also examinedusing a set of common evaluation metrics discussed in N. Craswell, A. P.de Vries, and Ian Soboroff, “Overview of the TREC-2005 EnterpriseTrack,” Text Retrieval Conference, 2005, and “Common EvaluationMeasures,” Appendix to the Proceedings of Text Retrieval Conference,2005. The evaluation results are shown in Table 1. Comparing with theresults published in the “Overview of the TREC-2005 Enterprise Track”document, with the values in Table 1, the baseline scheme using theOkapi relevance score would have been ranked second in the evaluation,suggesting that the retrieval accuracy for the baseline scheme is asgood as the state of the art in the information retrieval literaturethat do not take account of security issues. With regard to the OPE,even with the added layer of security, the performance would haveappeared in the top five search retrieval schemes evaluated in the TREC2005 conference.

By introducing the order-preserving encryption on row term frequencyvalues, the OPE enables document search on the data center side whilepreventing it from learning the critical term frequency information.When a query contains a single term, the OPE can achieve effectivesearch as the baseline model by accurately identifying the targetdocuments. This is because the order of term frequency values arepreserved after the inner-layer encryption, and the relevance score is astrictly increasing function of the term frequency. As the number ofterms in a query increases, the order may not be completely preservedwhen summing up scores of all terms. To examine the search accuracy formultiple terms, FIG. 8 shows the differences in the Mean AveragePrecision (MAP) for the baseline scheme and that for theorder-preserving encryption scheme for different numbers of searchterms. As the majority of queries in the W3C experiments, for which theground-truth is available, include 2 to 4 terms, the search accuracy isexamined and compared with the number of searched terms within thisrange. With multiple terms in a query, the accuracy of OPE is onlywithin a small gap from that of the baseline model. Thus, the number ofsearch terms in the query does not affect the performance of the OPEscheme. These results show that the OPE scheme is capable of effectivelyprocessing multiple-term queries while maintaining confidentiality ofthe content statistics.

FIG. 9 shows a scatter plot of the Mean Average Precision (MAP) valuesfor the fully server oriented (FSO) scheme plotted with respect to thebaseline scheme for the 59 search queries in the W3C database. Thefigure shows strong correlation, with the slope of the best linear fitclose to 1, indicating that there is no significant reduction inperformance for the FSO scheme compared to the baseline scheme.

As shown in FIG. 10, to compare the ranking accuracies, the modifiedKendall distance measure proposed in “Common Evaluation Measures,”Appendix to the Proceedings of Text Retrieval Conference, 2005, are usedto compare the top 20 and top 100 ranks obtained using the baselinescheme and the FSO scheme. The distance between the top 20 ranks for theFSO scheme and the baseline scheme is approximately 0.42 and thecorresponding value for the top 100 ranks is approximately 0.29. Thedistance for the top 20 ranks is higher because the random mapping maychange the order of the top 20 ranks. However, for the top 100 ranks thedistance is much lower because most of the top 100 documents are commonbetween the two lists.

Certain aspects of the proposed framework, as related to security,storage efficiency, search accuracy, and system complexity, will now bediscussed. If efficient storage of term frequency is needed, the innerlayer encryption in HME and OPE would have to retain the sparsity of theTF table by keeping the zero-valued terms. Thus the SCU may gainknowledge of the zero-valued TF, without knowing which plain-text termand which document these correspond to. The proposed schemes require asecure environment to initially generate the encrypted indices andencrypted documents. Usually such initial processing is required onlyonce. However, in the case when the collection is constantly changing,such as by adding more documents or changing the contents in existingdocuments, the secure index information in HME and OPE should also beupdated. For the OPE scheme, the mapping of frequency values for allterms that appear in the new/changed documents should be updated toensure security and search accuracy. In such cases, the cost ofmaintaining a secure search system can be relatively high. One method ofaddressing such incremental changes to the encrypted TF without acomplete update, would be to encrypt each document separately, insteadof encrypting the documents together. By doing so, while accuracy isslightly reduced due to the different encryption for the differentdocument, the documents can nevertheless be updated as needed.

The invention thus provides a new framework for secure andconfidentiality-preserving search and retrieval in large scale documentcollections, and techniques for securely rank-ordering the documents andextracting the most relevant documents from an encrypted collectionbased on the encrypted search queries. The baseline, fully and partiallyserver oriented schemes Maintain the confidentiality of the query aswell as the content of retrieved documents. The confidentialitypreserving system and method described herein are highly secure (relyingon the secure cryptographic encryption and hashing algorithms), accurate(comparable to conventional searching systems working with unencrypteddata), and efficient (in terms of computational complexity, andcommunication bandwidth), as demonstrated by experiments with the W3Ccollection (discussed above). The confidentiality preserving system andmethod have a wide range of applications, such as searching informationwith hierarchical access control, flexible “e-discovery” practices fordigital records in legal proceedings, a variety of multi-mediaapplications, image/video searching, and finger-print matching etc.

Although several embodiments of this invention have been described abovewith a certain degree of particularity, those skilled in the art maymake numerous alterations to the disclosed embodiments without departingfrom the scope of this invention. All directional references (e.g.,upper, lower, upward, downward, left, right, leftward, rightward, top,bottom, above, below, vertical, horizontal, clockwise andcounterclockwise) are only used for identification purposes to aid thereader's understanding of the present invention, and do not createlimitations, particularly as to the position, orientation, or use of theinvention. Joinder references (e.g., attached, coupled, connected, andthe like) are to be construed broadly and may include intermediatemembers between a connection of elements and relative movement betweenelements. As such, joinder references do not necessarily infer that twoelements are directly connected and in fixed relation to each other. Itis intended that all matter contained in the above description or shownin the accompanying drawings shall be interpreted as illustrative onlyand not as limiting. Changes in detail or structure may be made withoutdeparting from the invention as defined in the appended claims.

1. A confidentiality preserving system for performing a rank-orderedsearch and retrieval of contents of a data collection, the systemcomprising: at least one computer system including a search andretrieval algorithm using at least one of term frequency and similarfeatures for rank-ordering selective contents of the data collection,and enabling secure retrieval of the selective contents based on therank-order.
 2. A confidentiality preserving system according to claim 1,wherein the search and retrieval algorithm generates a relevance scorefor the rank-ordering based on at least one query.
 3. A confidentialitypreserving system according to claim 2, wherein at least one of the datacollection and query are encrypted.
 4. A confidentiality preservingsystem according to claim 1, wherein the data collection includes atleast one of documents and multi-media content.
 5. A confidentialitypreserving system according to claim 1, wherein the search and retrievalalgorithm includes at least one of a baseline algorithm, a partiallyserver oriented algorithm, and a fully server oriented algorithm.
 6. Aconfidentiality preserving system according to claim 5, wherein thebaseline algorithm includes a pre-processing algorithm for building asecure term frequency table and an inverse data collection frequencytable, and a search stage algorithm for the rank-ordering in response toa query.
 7. A confidentiality preserving system according to claim 6,wherein the pre-processing algorithm includes stemming of selectivecomponents of the contents of the data collection and mapping of thestemmed components in the term frequency table.
 8. A confidentialitypreserving system according to claim 7, wherein the selective componentsare words, and the data collection contents are documents.
 9. Aconfidentiality preserving system according to claim 6, wherein thesearch stage algorithm includes stemming of a query term, searching ofthe term frequency table, generation of a relevance score, rank orderingof the selective contents of the data collection based on the relevancescore, and retrieval of the selective contents of the data collectionbased on the rank order.
 10. A confidentiality preserving systemaccording to claim 6, wherein the pre-processing and search stagealgorithms are executed at a user site remote from a data center forstoring the data collection.
 11. A confidentiality preserving systemaccording to claim 5, wherein the partially server oriented algorithmincludes performance of selective computations at a user site remotefrom a data center for storing the data collection.
 12. Aconfidentiality preserving system according to claim 5, wherein thepartially server oriented algorithm includes at least one of building ofa term frequency table and generation of a relevance score at a usersite remote from a data center for storing the data collection.
 13. Aconfidentiality preserving system according to claim 5, wherein thefully server oriented algorithm includes building of a term frequencytable at a user site and generation of a relevance score at a securecomputing unit in a data center for storing the data collection.
 14. Aconfidentiality preserving system according to claim 5, wherein at leastone of the partially and fully server oriented algorithms use at leastone of homomorphic encryption and order-preserving encryption forenabling search capability from a user other than an owner of thecontents of the data collection.
 15. A confidentiality preserving methodfor performing a rank-ordered search and retrieval of contents of a datacollection, the method comprising: using at least one of term frequencyand similar features for rank-ordering selective contents of the datacollection; and securely retrieving the selective contents based on therank-order.
 16. A confidentiality preserving method according to claim15, further comprising generating a relevance score for therank-ordering based on at least one query.
 17. A confidentialitypreserving method according to claim 16, further comprising encryptingat least one of the data collection and query.
 18. A confidentialitypreserving method according to claim 15, wherein the data collectionincludes at least one of documents and multi-media content.
 19. Aconfidentiality preserving method according to claim 15, furthercomprising building a secure term frequency table and an inverse datacollection frequency table by stemming of selective components of thecontents of the data collection and mapping of the stemmed components inthe term frequency table.
 20. A confidentiality preserving methodaccording to claim 15, further comprising stemming of a query term,searching of a term frequency table, generation of a relevance score,rank ordering of the selective contents of the data collection based onthe relevance score, and retrieval of the selective contents of the datacollection based on the rank order.